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We have analyzed genetic data for 326 microsatellite markers that were typed uniformly in a large multiethnic 
population-based sample of individuals as part of a study of the genetics of hypertension (Family Blood Pressure 
Program). Subjects identified themselves as belonging to one of four major racial/ethnic groups (white, African 
American, East Asian, and Hispanic) and were recruited from 15 different geographic locales within the United States 
and Taiwan. Genetic cluster analysis of the microsatellite markers produced four major clusters, which showed 
near-perfect correspondence with the four self-reported race/ethnicity categories. Of 3,636 subjects of varying race/ 
ethnicity, only 5 (0.14%) showed genetic cluster membership different from their self-identified race/ethnicity. On 
the other hand, we detected only modest genetic differentiation between different current geographic locales within 
each race/ethnicity group. Thus, ancient geographic ancestry, which is highly correlated with self-identified race/ 
ethnicity—as opposed to current residence—is the major determinant of genetic structure in the U.S. population. 
Implications of this genetic structure for case-control association studies are discussed. 


Introduction 

From an evolutionary point of view, population strati¬ 
fication (genetically distinct subgrouping) and admixture 
(intermating between genetically distinct groups) are cre¬ 
ated by human mating patterns. Geographical, social, 
and cultural barriers have given rise to reproductively 
isolated human populations, within which random drift 
has produced genetic differentiation. Numerous recent 
studies using a variety of genetic markers have shown 
that, for example, individuals sampled worldwide fall 
into clusters that roughly correspond to continental lines, 
as well as to the commonly used self-identifying racial 
groups: Africans, European/West Asians, East Asians, 
Pacific Islanders, and Native Americans (Bowcock et al. 
1994; Calafell et al. 1998; Rosenberg et al. 2002). One 
significant consequence of population genetic structure 
is confounding in case-control association studies. Be- 
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cause of the unique political and social history of the 
United States, genetic structure in the contemporary U.S. 
population is extremely complicated. Most prominently, 
the level of white admixture among African Americans 
has been estimated at 10%-20% (Parra et al. 1998); 
more complicated are Hispanic groups, which may have 
European, Native American, and African ancestries that 
vary regionally (Hanis et al. 1991). In addition, strati¬ 
fication and admixture occur at finer levels. Such subtle 
heterogeneity is not readily detected with a limited num¬ 
ber of genetic markers, yet their implications in bio¬ 
medical research may be important. 

Epidemiologic designs that aim to detect associations 
between alleles and disease by use of unrelated cases and 
controls are popular because of their efficiency and the 
ease of recruiting subjects. However, spurious associa¬ 
tions between a trait and random genetic loci may arise 
as a result of subtle genetic structure (Lander and Schork 
1994). The impact of confounding due to population 
genetic structure in case-control studies has been de¬ 
bated (Thomas and Witte 2002; Wacholder et al. 2002). 

In light of the number of case-control studies that are 
being performed and planned, the above considerations 
warrant a careful examination of genetic structure with¬ 
in and between major population groups in the United 


268 



Tang et al.: Genetic Structure and Race/Ethnicity 


269 


States. One major goal is to quantify the correspondence 
between self-identified race/ethnicity (SIRE) and the ma¬ 
jor genetic structure that exists in the U.S. population. 
In addition, out of convenience or out of necessity, case 
and control subjects are sometimes recruited from dif¬ 
ferent geographic regions, matching only at the level of 
major racial group. An underlying assumption is the 
relative homogeneity within a single SIRE group. The 
validity of this assumption must be evaluated. Further¬ 
more, association studies among ethnically admixed 
populations are particularly vulnerable to spurious as¬ 
sociation. Although admixed groups have had relatively 
low representation in the U.S. population in the past, 
their representation is increasing. Whereas, historically, 
geneticists have avoided studying such individuals and 
groups because of the difficulties involved, it is no longer 
reasonable or fair to exclude such groups from genetic 
research. 

In this study, we examined the genetic structure be¬ 
tween and within major racial/ethnic groups by use of 
data from a large, ethnically diverse sample, the Family 
Blood Pressure Program (FBPP), which includes self-iden¬ 
tified white, African American, Hispanic (Mexican), and 
East Asian (Chinese and Japanese) subjects (FBPP In¬ 
vestigators 2002). Participants were enrolled, typically 
as sibships or nuclear families, at 15 field centers (re¬ 
cruitment sites), of which 11 are within the continental 
United States, 1 is in Hawaii, and 3 are in Taiwan. Details 
are provided in table Al (online only). This sample pro¬ 
vides a unique opportunity to answer several questions 
related to population structure. The degree of genetic 
differentiation can be assessed for this sample with re¬ 
spect to multiple levels of stratification. 

Material and Methods 

Subjects 

The FBPP is a collaborative effort of four research net¬ 
works (GenNet, GENOA, HyperGEN, and SAPPHIRe) 
that aims to investigate high blood pressure and related 
conditions in multiple racial/ethnic groups (FBPP Inves¬ 
tigators 2002). Each network has been funded by the 
National Heart, Lung, and Blood Institute (NHLBI) since 
1995. In total, DNA samples from 10,527 participants 
were genotyped at 326 autosomal genome screen micro¬ 
satellite markers by the NHLBI-sponsored Mammalian 
Genotyping Service (Marshfield, WI) (screening set 8) 
and had sufficient marker data for analysis (i.e., at most 
40 missing genotypes). 

Race/ethnicity information was obtained by self-de¬ 
scription. HyperGEN focused their recruitment on whites 
and African Americans. Subjects were given a response 
card and were allowed to endorse any of the following 
categories: “non-Hispanic white,” “non-Hispanic black,” 


“Hispanic,” “Asian,” “Pacific Islander,” “American In¬ 
dian/Alaska Native,” or “other.” 

GENOA concentrated their sampling on three groups: 
whites, African Americans, and Hispanics. They also em¬ 
ployed a response card and allowed subjects to endorse 
any of the following categories: “non-Hispanic white,” 
“African American,” “Hispanic/Mexican,” or “other.” 

GenNet focused their recruitment on white and Af¬ 
rican American subjects. Participants were asked for a 
self-description of their race/ethnicity without a list of 
choices. Responses other than “Caucasian/white” or 
“African American”—including “Hispanic”—were re¬ 
corded, but, in the pooled data set, they were listed as 
“other.” 

For all three of these networks, there were neither 
questions nor requirements regarding the race/ethnicity 
or ancestry of the participants’ parents or grandparents 
for inclusion in the study. SAPPHIRe focused their study 
on Asian populations. Specifically, they required subjects 
to report being Chinese and having four Chinese grand¬ 
parents or being Japanese and having four Japanese 
grandparents to be included in the study. 

Thus, in summary, each study participant identified 
him/herself as belonging to one of five categories: white 
non-Hispanic (CAU), black non-Hispanic (AFR), His¬ 
panic (HIS), Chinese (CHI), and Japanese (JAP). There¬ 
fore, in our analysis, SIRE corresponds to four major 
distinctions: CAU, AFR, HIS, and EAS, the latter refer¬ 
ring to East Asians (Chinese and Japanese combined), 
and one minor distinction, that between Chinese and 
Japanese. In the first analyses, which involved computing 
genetic distances and comparing SIRE with genetic struc¬ 
ture obtained from genetic cluster analysis, we randomly 
selected one participant with STR genotype information 
from each nuclear family and treated these participants 
as unrelated individuals; the resulting set consisted of 
3,648 individuals. Table Al (online only) summarizes 
the collection site and SIRE information of these indi¬ 
viduals. In total, this analysis included 1,349 self-iden¬ 
tified CAU, 1,308 AFR, 412 HIS, 407 CHI, 160 JAP, and 
12 OTH. Three of the “others” came from HyperGEN 
(one each from Salt Lake City, Minneapolis, and Fra¬ 
mingham, MA), eight came from GenNet (from Tecum- 
seh, MI), and one came from SAPPHIRe (from Hono¬ 
lulu). The rate of missing genotypes was <2%. 

Because of its focus on linkage analysis of hyperten¬ 
sion, the FBPP recruited sibships or nuclear families that 
typically had at least one hypertensive index subject, al¬ 
though precise ascertainment criteria varied among net¬ 
works (FBPP Investigators 2002). For analyses focusing 
on genetic stratification bias with respect to blood pres¬ 
sure, we selected the hypertensive individual (“case”) 
from those families with a single hypertensive subject 
and no other relatives and a single, randomly selected 
hypertensive individual from families with multiple hy- 
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pertensive subjects and at most one normotensive sub¬ 
ject. To obtain “controls,” we selected the normotensive 
subject from those families with a single normotensive 
subject and no relatives and a single, randomly selected 
normotensive individual from families with multiple 
normotensive subjects and at most one hypertensive in¬ 
dividual. For the networks and field centers that included 
only hypertensive subjects, this analysis was not possi¬ 
ble. If a family contained exactly one hypertensive sub¬ 
ject and one normotensive subject or more than one 
hypertensive subject and more than one normotensive 
subject, the family was not included in this analysis. 

Genetic Distance Analysis 

We created 18 subpopulations on the basis of the 
participants’ SIRE and the recruitment site (the few in¬ 
dividuals who identified their race/ethnicity as “other” 
were excluded from this analysis). As a measure of ge¬ 
netic distance, we computed the “coancestry coefficient” 
among groups (Reynolds et al. 1983). The coancestry 
coefficient is a measure of distance that is closely related 
to an average value of F ST across genes. To visualize these 
genetic distances, we performed multidimensional scal¬ 
ing (MDS) analysis (Mardia et al. 1980). In simple terms, 
this analysis provides a configuration of 18 points on a 
two-dimensional plane, such that the Euclidean distances 
among these points match the genetic distance matrix 
as closely as possible. 

Genetic Cluster Analysis 

In this analysis, we studied genetic similarity at an in¬ 
dividual level by use of the program structure (Pritchard 
et al. 2000). This approach is similar to that of a previous 
analysis (Rosenberg et al. 2002), except that the FBPP 
population primarily represents a United States-based 
sample. Because our goal is classification, we used the 
“NOADMIX” option in structure, so that the entire 
genome of each individual was assumed to have been 
derived from a single homogeneous population. We ex¬ 
amined the correspondence rate between SIRE and ge¬ 
netic cluster classification by crossclassifying subjects on 
the basis of these two criteria. 

Tests of Stratification 

To examine allele-frequency differentiation between 
pairs of groups defined either by geography or by disease 
status, we computed x 2 tests of independence on the 
basis of the 2x2 table of allele frequencies by group. 
Levels of significance were determined empirically by 
permutation analysis, with 10,000 permutations. For the 
microsatellite markers, each distinct allele was tested, 
provided that there were at least 50 occurrences of that 
allele in the two tested groups combined. We used this 


threshold to ensure adequate power to detect modest 
differences, given the sample sizes employed. Because of 
the small number of Chinese families recruited in Hawaii 
(n = 25) and the small number of Japanese families re¬ 
cruited in Stanford, CA (n = 16), these two field centers 
were excluded from this analysis. Since all Japanese in¬ 
dividuals in this analysis are from Hawaii and all His¬ 
panic individuals are from Starr County, TX, compari¬ 
son between sites was not performed within these two 
SIRE categories. 

Results 

Genetic Distance Analysis 

In table 1, the diagonal elements represent the mean 
(SD) of genetic distances between recruitment sites within 
a SIRE group; the corresponding figures across SIRE 
groups are indicated by the off-diagonal elements. The 
greatest genetic distances occur between populations 
with ancestries from different continents and little mix¬ 
ing (i.e., between East Asians and African Americans, 
followed by East Asians and whites). The second largest 
genetic distances are between the groups with some shared 
ancestry—namely, East Asians and Hispanics (whose 
Native American ancestry resembles that of Asians) and 
whites and African Americans (who have white admix¬ 
ture). Most similar are whites and Hispanics (who have 
substantial white admixture) and Chinese and Japanese. 
As can be seen by comparing the genetic distances on 
and off the diagonals in table 1, continental ancestry 
and separation time play more-important roles than cur¬ 
rent geographic distance. Thus, for example, Hawaiian 
Chinese bear much more genetic resemblance to Chinese 
from Stanford, CA, and from Taiwan than they do to 
Hawaiian Japanese. In fact, the genetic distances be¬ 
tween recruitment sites within SIRE categories are uni¬ 
formly very small. 

The MDS analysis for all 18 SIRE/site combinations 
is shown in figure 1A. As we expect, subpopulations of 
the same SIRE tend to cluster closely. Essentially, the X- 
axis separates the East Asians from the other groups, 


Table 1 

Average Genetic Distances (x 10~ 2 ) between SIRE/Site Pairs 



Average Genetic Distance (SD) between Pair 

CAU 

AFR 

HIS 

CHI 

JAP 

CAU 

.07 (.05) 

2.90 (.13) 

1.05 (.05) 

4.20 (.12) 

4.26 (.16) 

AFR 


.01 (.006) 

2.88 (.09) 

4.62 (.10) 

4.67 (.16) 

HIS 




3.09 (.01) 

3.03 (.16) 

CHI 




.02 (.02) 

.60 (.06) 

JAP 





.00 


Note.—G enetic distances were calculated by use of the coancestry 
coefficient of Reynolds et al. (1983). 
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A All SIRE/site combinations 
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First dimension 


Figure 1 MDS of the genetic distance matrix for 18 SIRE/site 
combinations (A) and 7 East Asian SIRE/site combinations (B). 


whereas the Y-axis separates the African Americans from 
the other groups. The MDS places the Hispanic group 
between the white cluster and the East Asian cluster, 
which is consistent with this being an admixed group 
with European and Native American ancestries and with 
Native Americans being closer, genetically, to the East 
Asians (Calafell et al. 1998). Although the Chinese and 
the Japanese groups appear clustered together in this 
plot, they are separable on another dimension. In other 
words, MDS with only the Asians produces excellent 


separation between the Chinese and the Japanese groups 
(fig. IB). 

Cenetic Clusters versus SIRE 

Genetic cluster analysis using structure was performed, 
allowing, sequentially, for k = 2, 3, 4, or more clusters 
(Pritchard et al. 2000). The results can be summarized 
as follows. When k = 2 clusters was specified, the Chi¬ 
nese and Japanese emerged as a combined cluster; when 
k = 3 clusters was specified, the African Americans sep¬ 
arated from the whites and Hispanics; when k = 4 clus¬ 
ters was specified, an additional cluster was formed that 
was nearly exclusively Hispanic (99.8%). All but one of 
the Hispanic individuals analyzed were included in this 
new cluster. The four-cluster results are given in table 2, 
with crossclassification by SIRE. Our sequential cluster 
results are completely consistent with what we observed 
from the genetic distance measures and from figure 1— 
namely, that the East Asians are the most distant from 
the other groups, followed by the African Americans, 
and then the Hispanics. Allowing for more than four 
clusters did not yield stable results: multiple runs of struc¬ 
ture produced varying cluster configurations; in many 
runs, one cluster was nearly empty. However, when we 
repeated the cluster analysis with only the East Asian 
subjects, two clusters did emerge that almost perfectly 
distinguished between the two ethnicities, with a total 
of 6 (2 Chinese and 4 Japanese) (1.1%) of 567 subjects 
being differentially classified. No such consistent sub¬ 
clusters emerged from separate analyses of the African 
American, white, or Hispanic groups. Thus, the structure 
we observed at the population level using MDS is re¬ 
captured here at an individual level. For the group re¬ 
porting a major SIRE category, the correspondence be¬ 
tween genetic cluster and SIRE is remarkably high, with 
only 5 (0.14%) of 3,636 individuals being differentially 
classified (table 2). Accordingly, in this case, major SIRE 
category and genetic cluster are effectively synonymous. 
Overall, our cluster analysis results are completely con¬ 
sistent with previous theoretical predictions regarding 
the ease of separating these groups on the basis of the 


Table 2 

Results of Genetic Cluster Analysis versus SIRE 
for Entire Sample 


SIRE 

No. of Subjects in Genetic Cluster 

A 

B 

C 

D 

CAU 

1,348 

0 

0 

1 

AFR 

3 

0 

1,305 

0 

HIS 

1 

0 

0 

411 

CHI 

0 

407 

0 

0 

JAP 

0 

160 

0 

0 

OTH 

1 

2 

0 

9 
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number of markers tested (Risch et al. 2002). Nearly all 
individuals had a cluster assignment probability of ~1. 
Only two subjects had a probability <.95: one of these 
subjects self-reported as Hispanic but fell into the white 
genetic cluster, and the other subject self-reported as Af¬ 
rican American but fell into the white genetic cluster. 
We note that this analysis was not based on determi¬ 
nation of individuals’ “racial” ancestry (e.g., estimating 
individual European, African, and Native American an¬ 
cestry for the African American and Hispanic subjects). 
To do so would require inclusion of the nonadmixed 
ancestral groups (such as Africans and Native Ameri¬ 
cans) and the use of the “ADMIX” option of structure. 
What our results do show is that the (admixed) groups 
included have approximated within-group random mat¬ 
ing sufficiently long enough to give rise to distinct genetic 
clusters. 

There were 12 individuals who reported “other” in 
response to the race/ethnicity question. Of these indi¬ 
viduals, nine were classified genetically in the Hispanic 
cluster, two in the East Asian cluster, and one in the 
white cluster. Eight of the nine subjects who fell into the 
Hispanic cluster were from GenNet (Tecumseh, MI), a 
site where the recruitment focused on whites. Tracing 
back to the original interview records we found that, in 
fact, all eight subjects self-reported as “Hispanic” but 
were categorized as “other” when included in the pooled 
data set. 

Our study deliberately sampled whites, African Ameri¬ 
cans, Hispanics, and East Asians; therefore, a more gen¬ 
eral survey would likely have produced a larger repre¬ 
sentation of individuals with other self-descriptions (e.g., 
Native Americans, Pacific Islanders, and South Asians). 
Nonetheless, our results do reflect an unbiased sampling 
of individuals who self-describe within the major cate¬ 
gories we included. 

Stratification by Geography 

We tested for differences in the frequency of alleles at 
each of the 326 microsatellite (STR) markers between 
subpopulations defined by SIRE and recruitment site. 
Table 3 displays the proportion of tests that were sig¬ 
nificant at the P = .05 level. Stratification across SIRE 

Table 3 


groups is uniformly high, with 3*40% of allele-frequency 
differences significant. The one exception, as expected, 
is the Chinese-Japanese comparison, involving two East 
Asian ethnicities, for which the proportion that are sig¬ 
nificant is ~18%. Perhaps of greater interest are the com¬ 
parisons within a SIRE group, which are indicated by 
the diagonal elements in table 3. Here, we see only a 
modest increase of significant tests over expected (5.3% 
for AFR and 6.3% for CAU). Thus, stratification within 
SIRE groups on the basis of current geography may lead 
to confounding, but the lack of significant geographic 
differences in allele frequencies suggests that the impact 
is not likely to be large. 

Tests of Stratification in Comparisons of Hypertensive 
Subjects with Normotensive Subjects 

To examine this question in the FBPP data, we selected 
“cases” (hypertensive subjects) and “controls” (normo¬ 
tensive subjects) in accordance with a scheme described 
in the “Material and Methods” section. We then tested 
for differences in the frequency of alleles at each of the 
326 microsatellite markers between the “cases” and “con¬ 
trols” and calculated the proportion of tests significant 
at the P = .05 level. We saw no trend toward an excess 
of significant tests (table 4). We also examined Q-Q plots 
of the entire distribution of P values for the alleles at 
the 326 markers and compared this distribution with the 
expected uniform distribution. None of these plots re¬ 
vealed any significant deviations from expectation. Thus, 
it appears that, at least in the context of these analyses 
of hypertension, sampling hypertensive cases and con¬ 
trols from the same local population does not create a 
serious confounding problem. 

Because the study sample was largely based on the pres¬ 
ence of hypertension—and hypertension is age related— 
age might also be acting as a confounder, if allele fre¬ 
quencies are age dependent. We therefore also undertook 
an analysis to determine whether there was genetic strat¬ 
ification in the sample on the basis of age, particularly 
in the admixed groups (African Americans and Mexican 
Americans). Each race/ethnicity group was divided in half 
at the median age (which ranged from 50 years to 58 
years), and allele frequencies were compared between 


Allele-Frequency Difference between SIRE/Site Combinations 



Proportion of Tests (± SE) Significant at P = .05 



CAU AFR 

HIS 

CHI 

JAP 

CAU 

AFR 

HIS 

CHI 

.063 (±.008) .576 (±.062) 

.053 (±.006) 

.414 (±.079) 
.640 (±.036) 

.493 (±.059) .566 

.554 (±.065) .642 

.482 (±.077) .557 

.047 (±.005) .182 

(±.047) 

(±.018) 

(±.034) 


Note. —On average, 1,660 alleles were tested between each pair of SIRE/site combi¬ 
nations. SEs are estimated on the basis of SIRE/site combinations. 
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Table 4 


Test of Stratification between Unrelated Normotensive Subjects and Hypertensive 
Subjects, for Various SIRE/Site Combinations 


SIRE AND 
Recruitment Site 

No. of Subjects 

Normotensive Hypertensive 

Proportion of 
Significant Alleles* 

No. of 
Alleles 
Tested 

AFR: 

Birmingham, AL 

35 

368 

.055 

1,799 

Forsyth, NC 

49 

149 

.058 

1,055 

Jackson, MS 

61 

389 

.042 

1,753 

Maywood, IL 

164 

55 

.048 

1,173 

CHI: 

Taiwan 

72 

156 

.044 

1,160 

HIS: 

Starr, TX 

175 

114 

.057 

1,375 

CAU: 

Tecumseh, MI 

216 

27 

.043 

1,265 

* Proportion of alleles with frequencies that are 

significantly different at the level of 


P = .05. 


the two age groupings for each allele. Examination of 
Q-Q plots of the distribution of P values from this 
analysis also showed near-perfect conformity with ex¬ 
pectation, a result that suggests no age trends in allele 
frequencies. 

Discussion 

Attention has recently focused on genetic structure in the 
human population. Some have argued that the amount 
of genetic variation within populations dwarfs the varia¬ 
tion between populations, suggesting that discrete ge¬ 
netic categories are not useful (Lewontin 1972; Cooper 
et al. 2003; Haga and Venter 2003). On the other hand, 
several studies have shown that individuals tend to clus¬ 
ter genetically with others of the same ancestral geo¬ 
graphic origins (Mountain and Cavalli-Sforza 1997; Ste¬ 
phens et al. 2001; Bamshad et al. 2003). Prior studies 
have generally been performed on a relatively small num¬ 
ber of individuals and/or markers. A recent study (Ro¬ 
senberg et al. 2002) examined 377 autosomal micro¬ 
satellite markers in 1,056 individuals from a global sam¬ 
ple of 52 populations and found significant evidence of 
genetic clustering, largely along geographic (continental) 
lines. Consistent with prior studies, the major genetic 
clusters consisted of Europeans/West Asians (whites), 
sub-Saharan Africans, East Asians, Pacific Islanders, and 
Native Americans. It is clear that the ability to define 
distinct genetic clusters depends on the number and type 
of markers used (Risch et al. 2002). Reports that docu¬ 
ment inability to define distinct clusters generally used 
only a modest number of markers and, hence, had little 
power to detect clusters (Romualdi et al. 2002). Studies 
with larger numbers of markers appear to show strong 
evidence of clustering (Stephens et al. 2001; Rosenberg 
et al. 2002). 


Another major point of discussion has been the cor¬ 
respondence between genetic clusters and commonly 
used racial/ethnic labels. Some have argued for poor 
correspondence between these two entities (Lewontin 
1972; Wilson et al. 2001), whereas others have sug¬ 
gested a strong correlation (Risch et al. 2002; Burchard 
et al. 2003). We have shown a nearly perfect corre¬ 
spondence between genetic cluster and SIRE for major 
ethnic groups living in the United States, with a discrep¬ 
ancy rate of only 0.14%. Perhaps this is not surprising 
for the major groupings (whites, East Asians, and Afri¬ 
can Americans), since prior studies would suggest enough 
genetic differentiation between these groups to produce 
robust clustering. On the other hand, one prior study 
of Hispanics did not suggest a distinct cluster for this 
group, possibly because of the heterogeneous origins of 
that Hispanic sample (Stephens et al. 2001). From the 
genetic perspective, Hispanics generally represent a dif¬ 
ferential mixture of European, Native American, and 
African ancestry, with the proportionate mix typically 
depending on country of origin. Our sample was from 
a single location in Texas and was composed of Mexican 
Americans. Although the genetic distance analysis sug¬ 
gested relative proximity to the whites in our sample, 
the distance was still sufficient to allow for creation of 
a distinct genetic cluster for this group. Again, this is 
likely because of the large number of markers used in 
our analysis. On the other hand, in the analysis of the 
full sample, the two East Asian groups—Chinese and 
Japanese—did not emerge as distinct subgroups, likely 
because their distance from one another was too modest 
to be detectable in the context of the larger sample. 
However, when the East Asians were analyzed sepa¬ 
rately, two clusters—corresponding to Chinese and Jap¬ 
anese—did emerge, with only a small amount of dis¬ 
cordance (6 [1 %] of 567 subjects). In contrast, cluster 
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analysis within the three other major clusters did not 
produce robust, replicable subgroups, indicating a lack 
of further subgroups within these entities, at least in the 
current marker set. This observation does not eliminate 
the potential for confounding in these populations. First, 
there may be subgroups within the larger population 
group that are too small to detect by cluster analysis. 
Second, there may not be discrete subgrouping but con¬ 
tinuous ancestral variation that could lead to stratifica¬ 
tion bias. For example, African Americans have a con¬ 
tinuous range of European ancestry that would not be 
detected by cluster analysis but could strongly confound 
genetic case-control studies. Furthermore, our analysis 
likely underrepresents individuals with recent mixed an¬ 
cestry (who would require more complex categorization) 
and other groups typically underrepresented, such as 
South Asians. Further study is required to evaluate the 
correlation between genetically determined groupings 
and SIRE for these individuals. 

Our observations also emphasize the importance of 
SIRE information: although statistical approaches using 
genetic marker information recapture SIRE with high 
accuracy, such analyses need to be guided by SIRE in¬ 
formation. The outcome of statistical cluster analyses 
depends on the (relative and absolute) sample size of 
the subgroups and on the homogeneity within groups 
relative to distance between groups. Without proper con¬ 
trolling of these nuisance factors, cluster analyses based 
on genetic markers sometimes overlook important com¬ 
ponents of population structure, while producing arti¬ 
fact clusters other times. 

We note that the genetic cluster results indicate that 
older geographic ancestry—rather than recent geographic 
origin—is highly correlated with racial/ethnic categoriza¬ 
tions and, thus, is the major determinant of genetic 
structure in the population. Although our results suggest 
that genetic stratification may exist within racial/ethnic 
groups—specifically, whites and African Americans sam¬ 
pled from different geographic locations in the United 
States—we found the differences based on current ge¬ 
ography to be quite modest. On the other hand, geo¬ 
graphic matching of Hispanic subjects is likely to be of 
much greater importance, given the larger genetic dif¬ 
ferentiation between Hispanic groups on the basis of 
current geographic origins. In this study, we could not 
evaluate this question directly, since Hispanics were re¬ 
cruited only from a single site. Also, these geographic 
analyses do not rule out other potential sources of con¬ 
founding within geographic regions for these groups 
(for example, those based on specific ethnic affiliations), 
which still may require attention. 

Our results also suggested little confounding when 
sampling cases and controls within SIRE and geographic 
groups for studies of hypertension. We detected little, if 
any, genetic differentiation at the 326 microsatellite 


markers between hypertensive and normotensive sub¬ 
jects in any of the ethnic groups we examined. However, 
this topic merits additional scrutiny—in particular, for 
the admixed subjects (Hispanics and African Ameri¬ 
cans)—to determine whether cases and controls have 
differential levels of admixture, which is likely to be the 
greatest source of confounding for these populations (H. 
Tang, personal communication). 

In summary, from a very large study of four major 
racial/ethnic groups within the United States and Taiwan, 
we found extraordinary correspondence between SIRE 
and genetic cluster categories but only modest geo¬ 
graphic differentiation within each race/ethnicity group. 
This result indicates that studies using genetic clusters 
instead of racial/ethnic labels are likely to simply repro¬ 
duce racial/ethnic differences, which may or may not 
be genetic. On the other hand, in the absence of racial/ 
ethnic information, it is tempting to attribute any ob¬ 
served difference between derived genetic clusters to a 
genetic etiology. Therefore, researchers performing stud¬ 
ies without racial/ethnic labels should be wary of char¬ 
acterizing difference between genetically defined clusters 
as genetic in origin, since social, cultural, economic, 
behavioral, and other environmental factors may result 
in extreme confounding (Risch et al. 2002). 
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